-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
algod importer: Update sync on WaitForBlock error. #122
Conversation
Codecov Report
@@ Coverage Diff @@
## master #122 +/- ##
==========================================
+ Coverage 67.66% 70.37% +2.71%
==========================================
Files 32 36 +4
Lines 1976 2535 +559
==========================================
+ Hits 1337 1784 +447
- Misses 570 654 +84
- Partials 69 97 +28
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct to me. I'm not sure I follow how this causes the pipeline to hang though.
Last I checked if you stop/start the node it will have the last MaxAcctLookback
deltas in cache (and even more rounds available). And it will also run ahead MaxAcctLookback-1
rounds.
So unless that number is 1, the node/pipeline should make progress despite the sync round being 1 round lower than what we expect. And the pipeline would correctly update the sync round once it processed another round.
I don't totally understand it either. I'm guessing there is some sort of cooldown / warmup time when rounds are being processed very quickly. For the file processor each round is being processed in the 50-200µs range. I was able to confirm that it's the case that the sync round needs to be called (this is with MaxAcctLookback = 64):
|
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct to me.
I made suggestions about rewording some comments, possibly using errors.Join
, and keeping a higher timeout value.
The following thought strikes me. During yesterday's standup it sounded like around 1 in 3 shutdowns of algod during catchup would result in this bug. So if one brings algod down and up enough times we could practically guarantee the bug. It may even be possible to simulate this issue reliably in our short-duration E2E tests.
const ( | ||
retries = 5 | ||
var ( | ||
waitForRoundTimeout = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
A more conservative timeout would be 45 seconds. I agree that we want to endow conduit with a greater amount of determinism regarding the outcome of each call to the waitForBlock
endpoint. So it's a good idea to make the call to the endpoint timeout on it's own terms rather than the endpoint's as is being done in the PR. On the other hand, we might still want the ability to keep 10 threads of the algod importer running concurrently after we've all caught up, and a 45 sec timeout would allow for that. If we narrow the timeout to 5 secs, we essentially only allow one or two algod importer threads to run at a time (probably only one due to round time variability).
On the other hand, we can change the value as aggressively as in the PR, and if the need arises in the future to raise it back to 45 secs we can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The low timeout was intended for responsiveness, basically when the node is stalled the timeout needs to elapse before the first recovery attempt. If there's a timeout I'm expecting the pipeline to retry the call.
The default retry count is 5, now I'm wondering if it should be unlimited.
The old Indexer had a package called fetcher, I wonder if we should bring that back to manage more optimal round caching: https://github.com/algorand/indexer/blob/master/fetcher/fetcher.go#L1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A worthwhile thought for a future PR or even the pipelining effort. (suggest keeping this thread unresolved for future reference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the default retry timeout to 0 in a followup PR, it's probably a good default anyway since people have expressed appreciation for Indexer working that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, even though I'm still curious if creating an E2E test is viable. That can be left as a future exercise.
Summary
If algod is restarted after it receives a sync round update but before it fetches the new round(s), then the algod follower and conduit will stall. Conduit will keep waiting for algod to reach the new sync round but it never happens.
This change adds some extra logic to the
WaitForBlock
call. If there is a timeout or a bad response, a new attempt to set the sync round is made.This PR also removes the retry loop from the algod importer. Retry is now managed by the pipeline.
Test Plan
Update existing unit tests.